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Abstract 

We introduce a novel algorithm for approximating the logarithm of the determinant of a sym¬ 
metric positive definite (SPD) matrix. The algorithm is randomized and approximates the 
traces of a small number of matrix powers of a specially constructed matrix, using the method 
of Avron and Toledo f ATII I . From a theoretical perspective, we present additive and relative 
error bounds for our algorithm. Our additive error bound works for any SPD matrix, whereas 
our relative error bound works for SPD matrices whose eigenvalues lie in the interval (A, 1), 
with 0 < 9 1 < 1; the latter setting was proposed in II IMS 15 1. From an empirical perspective, 
we demonstrate that a C++ implementation of our algorithm can approximate the logarithm 
of the determinant of large matrices very accurately in a matter of seconds. 


1 Introduction 

Given a matrix A e M nxn , the determinant of A, denoted by det(A), is one of the most impor¬ 
tant quantities associated with A. Since its invention by Cardano and Leibniz in the late 16th 
century, the determinant has been a fundamental mathematical concept with countless applica¬ 
tions in numerical linear algebra and scientific computing. The advent of Big Data, which are 
often represented by matrices, increased the applicability of algorithms that compute, exactly or 
approximately, matrix determinants; see, for example, ILZL05UZLLW081IZL07l[dBEG081lHSD + 13l 
for machine learning applications (e.g., gaussian process regression) and [LPOll IKL131IFHT081 
PB97. PBGS00 I for several data mining applications (e.g., spatial-temporal time series analysis). 

Formal definitions of the determinant include the well-known formulas derived by Leibniz 
and Laplace; however, neither the Laplace nor the Leibniz formula can be used to design an ef¬ 
ficient, polynomial-time, algorithm to compute the determinant of A. To achieve this goal, one 
should rely on other properties of the determinant. For example, a standard approach would be 
to leverage the so-called LU matrix decomposition or the Cholesky decomposition for symmetric 
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positive definite matrices (SPD) to get an 0(n 3 ) deterministic algorithm to compute the determi¬ 
nant of A. (Recall that an SPD matrix is a symmetric matrix with strictly positive eigenvalues.) 

In this paper, we are interested in approximating the logarithm of the determinant of a sym¬ 
metric positive definite (SPD) matrix A. The logarithm of the determinant, instead of the determi¬ 
nant itself, is important in several settings HLZL05l[ZLLW08llZL07lldBEG08llHSD + 13llLP01llKL13l 
IFHT081 IFE971 IPBGSOfll . 


Definition 1. [LogDet Problem definition] Given an SPD matrix A e M nxn , compute, exactly or 
approximately, logdet (A). 

Note that since all the eigenvalues of A are strictly positive, the determinant of A is strictly pos¬ 
itive. The best exact algorithm for the above problem simply computes the determinant of A in 
cubic time and takes its logarithm. Few approximation algorithms have appeared in the literature, 
but they either lack a proper theoretical convergence analysis or do not work for all SPD matrices. 
We will discuss prior work in detail in Section [L2| 


1.1 Our contributions 

We present a fast approximation algorithm for the problem of Definition [lj Our main algorithm 
(Algorithm [3j is randomized and its running time is 

O (nnz(A) ( me ~ 2 + logn) log(l/<5)) , 


where nnz(A) denotes the number of non-zero elements in A, 0 < 5 < 1 denotes the failure 
probability of our algorithm, and (integer) m > 0 and (real) e > 0 are user-controlled accuracy 
parameters that are specified in the input of the algorithm. The first step of our approximation 
algorithm uses the power method to compute an approximation to the dominant eigenvalue of A. 
This value will be used in a normalization (preconditioning) step in order to compute a convergent 
matrix-Taylor expansion. The second step of our algorithm leverages a truncated matrix-Taylor 
expansion of a suitably constructed matrix in order to compute an approximation of the log deter¬ 
minant. This second step leverages a randomized trace estimation algorithm from PATll l. 

Let logdet (A) be the value returned by our approximation algorithm (Algorithm 3 J; let logdet (A) 
be the true log determinant of A; let A, (A) denote the i-th eigenvalue of A for alii = 1,..., n with 
A 1 (A)>A 2 (A)>...>A n (A) > 0; and let k (A) = Ai(A)/A n (A) be the condition number of A. 
Our main result, proven in Lemma [6j is that if 


m > 


7k (A) log 


( 1 ) 


then, with probability at least 1 — 25, 


logdet (A) — logdet (A) 


< 2eT, 


( 2 ) 


where 


r = £iog 

i=l 
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We now take a careful look at the above approximation bound. First, given our choice of m in 
eqn. ([lj, the running time of the algorithm becomes 

O (nnz(A) (k (A) log (1/e) e~ 2 + logn) log(l/<5)) . (3) 

Thus, the running time of our algorithm increases linearly with the condition number of A. The 
error of our algorithm scales with T, a quantity that is not immediately comparable to logdet (A). It 
is worth noting that the T term increases logarithmically with respect to the ratios Ai(A)/Aj(A) > 1. 
An obvious, but potentially loose upper bound for the sum of those ratios, is 

r = X] lo g ( 7 • < ra ■ log (7 k(A)) . (4) 


Our second result handles the family of SPD matrices whose eigenvalues all lie in the interval 
(flu 1)/ with 0 < 0i < 1; this setting was proposed in IHMS15I . In this case, a simplified version 
of Algorithm [3] returns a relative error approximation to the log-determinant of the input matrix. 
Indeed, Lemma [8] proves that, with probability at least 1 — 5, 


logdet (A) — logdet (A) 


< 2e|logdet (A) 


The running time of the simplified algorithm is 

o (V£)MW„„, (a) ) . (5) 

Finally, we implemented our algorithm in C++ and tested it on several large dense and sparse 
matrices. Our dense implementation runs on top of Elemental llPMVdG+131 . a linear algebra 
library for distributed matrix computations with dense matrices. Our sparse implementation runs 
on top of Eigen Q a software library for sparse matrix computations. Our code is available to 
download on Github (see Section[5]for details and a link to our code). 


1.2 Related Work 

The most relevant result to ours is the work in 1BP99H . Barry and Pace IBP991 described a ran¬ 
domized algorithm for approximating the logarithm of the determinant of a matrix with special 
structure that we will describe below. They show that in order to approximate the logarithm of 
the determinant of a matrix A, it suffices to approximate the traces of D fc , for k = 1, 2,3... for a 
suitably constructed matrix D. Specifically, IBP99I deals with approximations to SPD matrices A 
of the form A = I„ — qD, where 0 < a < 1 and all eigenvalues of D are in the interval [—1,1]. 
Given such a matrix A, the authors of IBP99I seek to derive an estimator logdet (A) that is close to 
logdet (A). HBP99H proved (using the so-called Martin expansion IMar92l ) that 

m b. oo ju 

log(d e .(A)) = -^^.r(D‘)-^^.r(D‘). 


'http://eigen.tuxfamily.org/ 
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They considered the following estimator: 


logdet (A) = ^ J2 
i =1 


—n 


E 

k =1 


o'" Z 


T n k z, 


T 

Z„ Z; 


Vi 


All Vi for i = 1... p are random variables and the value of p controls the variance of the estimator. 
The algorithm in 1BP99I constructs vectors z, G R n whose entries are independent identically 
distributed standard Gaussian random variables. The above estimator ignores the trailing terms 
of the Martin expansion and only tries to approximate the first m terms. I BP99 1 presented the 
following approximation bound: 


logdet (A) — logdet (A) 


n ■ a m 1 a 2 

— 7 _i_ no f + 1-96 ■ \ , 

(m + 1)(1 - a) V p 


where a 2 is the variance of the random variable Vi. The above bound fails with probability at most 
0.05. 


We now compare the results in llBP99l with ours. First, the idea of using the Martin expan¬ 
sion I Mar92 1 to relate the logarithm of the determinant and traces of matrix powers is present in 
both approaches. Second, the algorithm of liBP99l is applicable to SPD matrices that have special 
structure, while our algorithm is applicable to any SPD matrix. Intuitively, we overcome this lim¬ 
itation of IBP99 I by estimating the top eigenvalue of the matrix in the first step of our algorithm. 
Third, our error bound is much better that the error bound of HBP99L To analyze our algorithm, 
we used the theory of randomized trace estimators of Avron and Toledo HAT11I . which relies on 
stronger measure-concentration inequalities than I1 BP99 I, which uses the weaker Chebyshev's in¬ 
equality. 

A similar idea using Chebyshev polynomials appeared in the paper IPL04I1 : to the best of our 
understanding, there are no theoretical convergence properties of the proposed algorithm. Appli¬ 
cations to Gaussian process regression appeared in I1LZL051IZLLW0811ZL071 . The work of [Reu021 
uses an approximate matrix inverse to compute the n-th root of the determinant of A for large 
sparse SPD matrices. The error bounds in this work are a posteriori and thus not directly compa¬ 
rable to our bounds. 

1HAB141 provides a strong worst-case theoretical result which is, however, only applicable to 
Symmetric Diagonally Dominant (SDD) matrices. The algorithm is randomized and guarantees 

< £ 


n, for a user specified error parameter 
[Mar92 1 as well as ideas from preconditioning 


that, with high probability, logdet (A) — logdet (A) 
e > 0. This approach also uses the Martin expansion 
systems of linear equations with Laplacian matrices I S 1041 . The algorithm of 11 1AB141 runs in 
time O (nnz(A)e -2 log 3 (n) log 2 (nn( A)/e)) . To compare to our approach, we need to combine 
the suboptimal upper bound for T from eqn. |4]) with the bound of eqn. Q. Then, we can run 
Algorithm [3] with input 


e' = 


log(7/c(A)) ’ 
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instead of e to guarantee that the final error of our approximation will be bounded by e'n. Then, 
we can observe that the running time of [ HAB141 depends logarithmically on the condition number 
of the input matrix A, whereas our algorithm has a linear dependency on the condition number. 
Notice, however, that our method is applicable to any SPD matrix while the method in I HAB14I 
is applicable only to SDD matrices; given current state-of-the-art on Laplacian preconditioners it 
looks hard to extend the approach of [ HAB14J to general SPD matrices. 

Independently and in parallel with our work, [ HMS15 I presented an algorithm using Stochas¬ 
tic Chebyshev Expansions for the log-determinant problem. The algorithm is very similar in spirit 
to our approach, using the Chebyshev instead of the Taylor expansion and achieves relative-error 
guarantees for a special class of SPD matrices, namely matrices whose eigenvalues all lie in the 
interval (9i,l — 0\) for some 0 < 6\ < 1/2. As we already discussed, our algorithm also achieves 
a relative error bound under such an assumption; the only difference is that the running time 

of HHMS151 is proportional to y/^log Jj, whereas the running time of our approach (see eqn. (5 >) 

is proportional to gk This slightly improved running time might be due to the use of the Stochas¬ 
tic Chebyshev Expansions. However, importantly, our algorithm works for any SPD matrix, with 
arbitrary eigenvalues. Not surprisingly, the added generality comes with a loss in accuracy and 
the relative error bound becomes an additive error bound. 

Finally, two very recent papers [ HMAS16 ;. SAI16I presented algorithms to approximate the 
logdet of a matrix, highlighting the renewed importance of the topic. The work of HSAI16H presents 
a very novel approach to approximate the logdet of a positive semi-definite matrix, using a ran¬ 
domized subspace iteration approach. To the best of our understanding, the relevant bounds 
in their work (Theorem 2 in [SAI16 D are not directly comparable to our bounds. The work 
of 1 11V1AS16 I follows the lines of [ HMS15 I and leverages the use of Chebyshev approximations to 
propose novel estimators for the trace of a matrix function. Among the many exciting applications 
of the proposed approach is an additive-error approach to approximate the logdet of any square 
non-singular matrix; the algorithm needs as inputs upper and lower bounds for all the singular 
values of the input matrix. Similar to the running time of our additive error algorithm in eqn. (|3|, 
the time complexity of the proposed algorithm depends on the condition number of the input 
matrix (see Corollary 7 of 1HMAS16 1). 

We conclude by noting that common algorithms for the determinant computation assume 
floating point arithmetic and do not measure bit operations. If the computational cost is to be 
measured in bit operations, the situation is much more complicated and an exact computation of 
the determinant, even for integer matrices, is not trivial. We refer the interested reader to ll EGVOO I 
for more details. 


2 Preliminaries 

2.1 Notation 

Let A, B.... denote matrices and let a, b,... denote column vectors. I„ is the nxn identity matrix; 
0 mxn is the m x n matrix of zeros; tr (A) is the trace of a square matrix A; the Frobenius and 

2 Both papers appeared after an earlier version of this paper was posted on ArXiv on March 2015 and cite this earlier 
version of our work. 
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the spectral matrix-norms are: ||A||| = j A % and || A ||2 = max|| x || 2=1 || Ax|| 2 - We denote the 
determinant of a matrix A by det(A) and the (natural) logarithm of the determinant of A by 
logdet (A). We use log x to denote the natural logarithm of x . Finally, given an event £, P [£] 
denotes the probability of the event. 

For an SPD matrix A E M nxn , log [A] is an n x n matrix defined as: log [A] = UDU T , where 
U E R nxn contains the eigenvectors of A and D E K" xn is diagonal with entries being 

log(Ai(A)), log(A 2 (A)),..., log(A n (A)). 

Let x be a scalar variable that satisfies |x| < 1. Then, using the Taylor expansion, 

00 k 

log(l-*) = -£ y- 
k =1 

A matrix-valued generalization of this identity is the following statement. 

Lemma 1. Let A E M nxn be a symmetric matrix whose eigenvalues all lie in the interval (—1,1). Then, 

OO . /j. 

log(I„ -A) = -Ey 

k =1 


2.2 Power method 

The first step in our algorithm for approximating the determinant of an SPD matrix is to obtain 
an estimate for the largest eigenvalue of the matrix. Given an SPD matrix A E M nxn we will 
use the power-method (Algorithm [l| to obtain an accurate estimate of its largest eigenvalue. This 
estimated eigenvalue is denoted by Ai(A). 


Algorithm 1 Power method, repeated q times. 

• Input: SPD matrix A E M nxn , integers q. t > 0 

• For j = 1,... . q 

1. Pick uniformly at random a vector Xg E {+1, -l} n 

2. For i = 1,..., t 

• x? = A • x J 1 

l l—l 

•T 

3. Compute: A{(A) = Xt ,^ Xt 

• Return: Ai(A) = max J= \ q A) (and the corresponding vector x* = x() 


Algorithm [l] requires 0(qt(n + nnz(A))) arithmetic operations to compute Ai(A). Lemma [ 2 ] 
(see IITrellll for a proof) argues that any A{(A) is close to Ai(A). 
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Lemma 2. For any fixed j = 1... q, and for any t > 0, e > 0, with probability at least 3/16, 


(1 -e) 


1 + 4n(l — e) 


2 1 


Ai(A) < 


X' 


Ax? 


x: 


3 3 

t 



Let e = 2.718 ... and let e = 1 — (1/e) and t = [log\/4n~|; then, with probability at least 3/16, for 
any fixed j = 1... q, 

jUi(A)<iMA)<y(A). 

It is now easy to see that the largest value Ai(A) (and the corresponding vector x,) fails to satisfy 
the inequality (l/6)Ai(A) < Ai(A) with probability at most 





<<5, 


where the last inequality follows by setting q = [4.82 log(l/<5)~| > log(l/<5)/log(16/13). Finally, we 
note that, from the min-max principle, Ai(A) < Ai(A). We summarize the above discussion in the 
following lemma. 

Lemma 3. Let Ai(A) be the output of Algor ithm^ivith q = [4.821og(l/<5)] and t = [log y/An\. Then, 
with probability at least 1 — 5, 

pAi(A)<Ai(A)<Ai(A). 

fa 

The running tune of Algorithm^is O ((n + nnz( A)) log(n) log (^)) . 


2.3 Trace estimation 

Even though computing the trace of a square n x n matrix requires only 0(n) arithmetic op¬ 
erations, the situation is more complicated when A is given through a matrix function, e.g., 
A = X 2 , for some matrix X and the user only observes X. For situations such as these, Avron and 
Toledo f ATI ill analyzed several algorithms to estimate the trace of A. Algorithm [2] and Lemma |4] 
present the relevant results from their paper. 

Lemma 4. Let A e M nxn be an SPD matrix, let 0 < e < 1 be an accuracy parameter, and let 0 < 5 < 1 
be a failure probability. If gi, g2, • • •, g P £ W 1 are independent random standard Gaussian vectors, then, 
for p = [201og(2/<f)/e 2 ~|, with probability at least 1 — 5, 


tr (A) 


V 


Ag< 


i=l 


< e tr (A). 


The above lemma is immediate from Theorem 5.2 in IIAT11 II . 
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Algorithm 2 Randomized Trace Estimation 

• Input: SPD matrix A e M nxn , accuracy 0 < e < 1, and failure probability 0 < 5 < 1. 

1. Letp= |~201og(2/<5)/e 2 ~| 

2. Let gi, g 2 ,..., g p be a set of independent Gaussian vectors in M n 

3. Let 7 = 0 

4. Lor i = 1,... ,p 

• 7 = 7 + g J A gi 

5. 7 = 7/p 

• Return: 7 


3 Additive error approximation for general SPD matrices 


Lemma |5]is the starting point of our main algorithm for approximating the determinant of a sym¬ 
metric positive definite matrix. The message in the lemma is that computing the log determinant 
of an SPD matrix A reduces to the task of computing the largest eigenvalue of A and the trace of 
all the powers of a matrix C related to A. 


Lemma 5. Let A e M nxn be an SPD matrix. 
C := I n — B. Then, 

logdet (A) = n 


For any a with Ai(A) < a , define B := 


OO 

log(a) - 

k=1 


tr{C k ) 

k 


A/a and 


Proof. Observe that B is an SPD matrix with ||B II2 < 1- It follows that 

logdet (A) = log(a n det(A/a)) 

= n log(a) + log j JJ Xi (B) 

\i=l 
n 

= nlog(a) + ^log(Aj(B)) 
i= 1 

= nlog(a) +tr (log [B]). 

Here, we used standard properties of the determinant, standard properties of the logarithm func¬ 
tion, and the fact that (recall that B is an SPD matrix), 

n n 

tr (log [B]) = £ A,(lo g [B]) = 5>g(A,(B)). 

i= 1 2— 1 

Now, 

.r(lo s [B]) = tr (log [I„ - (I„ - B)]) = tr (- ± = - f) ^. (6 ) 

V k =1 / k=l 
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The second equality follows by the Taylor expansion because all the eigenvalues of C = I n — B 
are containedjin (0,1) and the last equality follows by the linearity of the trace operator. ■ 

3.1 Algorithm 

Lemma [5] indicates the following high-level procedure for computing the logdet of an SPD matrix 
A: 

1. Compute some a with Ai(A) < a. 

2. Compute C = I n — A/a. 

3. Compute the trace of all the powers of C. 

To implement the first step in this procedure we use the power iteration from the numerical linear 
algebra literature (see Section [Z2j |. The second step is straightforward. To implement the third 
step, we keep a finite number of summands in the expansion YlV=i * r (C fc ). This step is important 
since the quality of the approximation, both theoretically and empirically, depends on the number 
of summands (denoted with m) that will be kept. On the other hand, the running time of the algo¬ 
rithm increases with m. Finally, to estimate the traces of the powers of C, we use the randomized 
algorithm of Section [23] Our approach is described in detail in Algorithm |3j notice that step 7 in 
Algorithm [3] is an efficient way of computing 


logdet (A) := n log (a) - 'g, T C fc g * j /k. 



3.2 Error bound 


The following lemma proves that Algorithm [3] returns an accurate approximation to the logdet of 
A. 


Lemma 6. Let logdet (A) be the output on inputs A, m , and e. Then, with probability at 

least 1 — 25, 




inhere T = ]P " =1 log ^7 • • Ifm > |~7k (A) log ( 7 )], then 


logdet (A) — logdet (A) < 2eT. 


Proof. First, note that using our choice for a in Step 3 of Algorithm [3] and applying Lemma |3} we 
get that, with probability at least 1 — 5, 


Ai(A)<^Ai(A)<a<7A!(A) 

o 


3 Indeed, Ai(C) = 1 — Ai(B) and 0 < A;(B) < 1 for all i = 1... n. 
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Algorithm 3 Randomized Log Determinant Estimation 
1 : INPUT: A G M nxn , accuracy parameter e > 0, and integer m > 0. 

2 : Compute Ai(A) using Algorithm [T] with (integers)/: = O (logn) and q = O (log(l /())') 
3: Pick a = 7Ai(A) 

4: Set C = I n — A/ a 
5: Setp= |~201og(2/5)/e 2 ~| 

6 : Let gi, g 2 ,..., g p G M n be i.i.d. random Gaussian vectors. 

7: For i = 1, 2 ... ,p 

• = Cg, and 7^ = gj v'g’ 

• For k = 2,..., m 

1 v (i) •— fv® 
v fc ■- ^ V fc-1- 

2. 4k = El v /, l) (Inductively 7 ^ = g JC k g t ) 

• EndFor 
8 : EndFor 

9 : OUTPUT: bidit (A) = n log(a) - EL. (\ EL. 7?) A 


The strick inequality at the leftmost side of the above equation follows since all eigenvalues of A 
are strictly positive. Let's call the event that the above inequality holds E\) obviously, P [Pi] > 1 — S 
(and thus P [Pi] < 5). We condition all further derivations on Pi holding and we manipulate 


A = 


logdet (A) — logdet (A) 


as follows: 


A = 


< 


E AsW A-Eh c ‘) a 


, V 

k= 1 i=l 


k =1 
m 


E/ l -En c ‘) a 

k= 1 V i =1 / fc=. 

p / m \ / m \ 

E C ‘A 6-te E C ‘A 


+ 


E tr(c fc )/fe 

fc=m+l 


p 


i= 1 


^fc=l 


^fc=l 


+ 

^ s. 


E tr ( c ") /fc 

fc=ra+l 


A 2 


Below, we bound the two terms Ai and A 2 separately. We start with Ap the idea is to apply 
Lemma 4 on the matrix ELi with p = \20 log(2 /5)/ e 2 ]. Let £2 denote the probability that 
: holds; obviously, P [P 2 ] >1 — 5 (and thus P [P 2 ] < 5) given our choice of p. We condition 


Lemma 4 


all further derivations on £2 holding as well to get 

/ m \ / 00 N 

Ai < e • tr E Qk / k < e • tr E ° k ! k 


\k =1 


^fc=l 
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In the last inequality we used the fact that C is a positive matrix, hence for all k, tr (■ C k ) > 0. The 
second term A 2 is bounded as follows: 


Ao = 


E tr ( Ck )/k 


k=m-\-l 
00 


< E ^( ck )/k 


k=m -\-1 


E tr (C m • /k < E l|C m || 2 -tr(c fc - m 


/k 


= IIC r 


OO OO 

E tr /k < ||C m || 2 • E tr (° fc ) A 


k=m +1 


k= 1 


a 


m 00 




k =1 


In the first inequality, we used the triangle inequality and the fact that C is a positive matrix. In 
the second inequality, we used the following faclQ given two positive semidefinite matrices A, B 
of the same size, tr (AB) < || A|| 2 • tr (B). In the last inequality, we used the fact that 

Ai(C) = 1 - A„(B) = 1 - X n (A)/a. 

Combining the bounds for Ai and A 2 gives 


logdet (A) — logdet (A) < ( e + 


1 _MA)\”\ “ tr (C*) 

a ' ' k =1 ^ 


We have already proven in Lemma [5] that 

00 tr (C k ) 

E — J ,— = —tr ( lo § t B D = nlog(a) - logdet (A). 
k =1 

Notice that the assumption of Lemma [5] (namely, Ai(A) < a) is satisfied from the inequality of 
eqn. ([7jl. We further manipulate the last term as follows: 


ralog(a) — logdet (A) = nlog(a) — MllMA)) 

1=1 

n 

= nlog(a) - E lo g(A*( A )) 


1=1 


Eo°g(«) - io g(^(A))) 
2=1 
n 

E lc g 


2=1 


Aj(A) J ' 


4 This follows from Von Neumann's trace inequality. 
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Collecting our results together, we get: 


logdet (A) — logdet (A) 


< e+ 1 - 


>(A) 

a 


E 

i =1 


log 


a 

A (A) 


Using the inequality of eqn. (|7]> (only the upper bound on a is needed here) proves the first inequal¬ 
ity of the lemma. To prove the second inequality, we use the well-known fact that (1 
(where e = 2.718 ... and x > 0) and our choice for m. 

Finally, recall that we conditioned all derivations on events £\ and £2 both holding, which can 
be bounded as follows: 


P [£ x n £ 2 ] = 1 - P [£i US 2 } > 1 - P [Si] - P [£ 1 ] >1-26. 


The first inequality in the above derivation follows from the union bound. 


3.3 Running time 

Step 2 takes 0(nnz(A) log(n) log(l/<5)) time; we assume that nnz{ A) > n, since otherwise the 
determinant of A would be trivially equal to zero. For each k > 0, v& = C^g,. The algorithm 
inductively computes v*. and g, C k g, = g J v/, ; for all k = 1,2 ,m. Given v/,_ 1 , v/, and g, C k g , 
can be computed in nnz( C) and O(n) time, respectively. Notice that nnz(G) < n + nnz(A). 
Therefore, step 7 requires 0(p ■ m ■ nnz(A)) time. Since p = (){e 2 log(l/d)). the total cost is 

O ^nnz(A) • (J^ + logn) ■ log . 


4 Relative error approximation for SPD matrices with bounded eigen¬ 
values 

In this section, we argue that a simplified version of Algorithm [3] achieves a relative error approx¬ 
imation to the logdet of the SPD matrix A, under the assumption that all the eigenvalues of A lie 
in the interval ( 6 \. 1), where 0 < 0\ < 1. This is a mild generalization of the setting introduced 
in HHMS15I . 

Given the upper bound on the largest eigenvalue of A, the proof of the following lemma 
(which is the analog of Lemma [5]) is straightforward. 

Lemma 7. Let A G M nxn be an SPD matrix zvhose eigenvalues lie in the interval ( 61 ,1), for some 0 < 
6 \ < 1. Let C := I n — A; then, 

™ tr ( C k ) 

logdet (A) = - — j ,— • 

fc=i 

Proof. Similarly to the proof of Lemma |5j 

( n \ n 

n M A ) = ^log(Aj(A)) = tr(log[A]). 
i=1 J i=l 
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Now, 


tr (tog [A]) = tr (log [I„ - (I„ - A)]) = tr (- f) ) = _ £ LM, 

\ k=1 ' J k= 1 

The second equality follows by the Taylor expansion since all the eigenvalues of C = I n — A are 
contained in the interval (0,1). ■ 


4.1 The algorithm and the relative error bound 

We simplify Algorithm [3] as follows: we skip steps 2 and 3 and in step 4 we set C = I n — A. The 
following lemma proves that in this special case the modified algorithm returns a relative error 
approximation to the log determinant of the input matrix A. 

Lemma 8. Let logdet (A) be the output of the (modified) Algorithm [i] on inputs A and e. Then, ivith 
probability at least 1 — 5, 


logdet (A) — logdet (A) 


< 2e • |logdet (A) | . 


Proof. Similarly to the proof of Lemma joj we manipulate A = logdet (A) — logdet (A) 


as follows: 


A = 


< 


E ;E& T °‘s /* 


. p 

k =1 i—1 


k=1 
m 


E A& Tcl ® A-N tr ( c ‘) / k 

k =1 V i =1 / k =1 

p / m \ / rn \ 

^E*. 7 Ewda-tr E c ‘/q 


P 


+ 


+ 


i =1 \fc=l 


\k =1 


E tr ( Ck )/k 

k=m -\-1 
oo 

E tr ( cfc )/ fc 

fc=m+l 

■ v 

A 2 


We now bound the two terms Ai and A 2 separately We start with Ai: the idea is to apply 
Lemma |i on the matrix 1 C fc /A: with p = [20 log(2/f))/c 2 ]. Hence, with probability at least 
1 — 5 (this is the only probabilistic event in this lemma and hence 1 — 5 is a lower bound on the 
success probability of the lemma): 


Ai < e • tr E Qk / k < £ ■ tr E ° k / k J ' 


\k= 1 


^fc=l 


In the last inequality we used the fact that C is a positive definite matrix, hence for all k, tr (C fc ) > 
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0. Bounding A 2 follows the lines of the proof of Lemma [ 6 ] 


A2 = 


< 


E tr ( ck ) / k 

k=m-\-l 


E tr(C m -C k - m j /k 

k=m -\-1 


E ||C m ||2 ■ tr yC k ~ m J /k 

k=m -\-1 


E tr(c k ~ m J/k 

k=m+l 


2 ' 


E tr c ") A 


k= 1 


< (1-An (A)) ? 


E tr c ") A 


fc=l 


In the last inequality, we used the fact that Ai(C) = 1 — A n (A). Combining the bounds for Ai and 
A 2 gives 


_ 00 tr (C k ) 

logdet (A) - logdet (A) < (e + (1 - X n (A)) m ) • E — t —• 


k =1 


We have already proven in Lemma [7] that 


E 

k=l 


tr(C fe ) 


k 


= -tr (log [A]) = -logdet (A). 


Collecting our results, we get: 

logdet (A) - logdet (A) < (e + (1 — A n (A)) m ) • | logdet (A)| . 
Using 1 — A n (A) < 1 — 6 * 1 , we conclude that 

logdet (A) — logdet (A) < (e + (1 — 9i) m ) ■ (logdet (A)| . 

Setting 


m = 


7T ' lo § — 


and using (l — x 1 ) J < e 1 (where e = 2.718... and x > 0), guarantees that (1 — 9i) m < e and 
concludes the proof of the lemma. ■ 

We conclude by discussing the running time of the simplified Algorithm |3j which is equal to 


0(p ■ m ■ nnz(A)). Since p = O 


and m = O ( log ^y £ -* 


, the running time becomes 


O 


/log(l/e)log(l/ 6 ) 




e 2 9i 


nnz(A) 
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5 Experiments 


The goal of our experimental section is to establish that our approximation to logdet (A) (as com¬ 
puted by Algorithm [3]) is both accurate and fast for both dense and sparse matrices. The accuracy 
of Algorithm [3] is measured by comparing its result against the exact logdet (A) computed via the 
Cholesky factorization. The rest of this section is organized as follows: in Section |5d| we describe 
our software for approximating logdet (A); in Section 5.2 we describe the computational environ¬ 
ment that we used; and in Sections 5.3 and |5.4| we discuss experimental results for dense and 
sparse SPD matrices, respectively. 


5.1 Software 

We developed high-quality, shared- and distributed-memory parallel C++ code for the algorithms 
listed in this paper. All of the code that was developed for this paper is hosted at https: 
//github . com/pkambadu/ApproxLogDet In it's current state, our software supports: (1) 
ingesting dense (binary and text format) and sparse (binary, text, and matrix market format) 
matrices, (2) generating large random SPD matrices, (3) computing both approximate and ex¬ 
act spectral norms of matrices, (4) computing both approximate and exact traces of matrices, and 
(5) computing both approximate and exact log determinants of matrices. Currently, we support 
both Eigen [GJ + 10| and Elemental flPMVdG + 13l matrices. The Eigen software package supports 
both dense and sparse matrices, while the Elemental software package mostly supports dense 
matrices and only recently added support for sparse matrices (pre-release). As we wanted the 
random SPD generation to be fast, we have used parallel random number generators from Ran¬ 
domly f SMDSli l in conjunction with Boost.Random. 


5.2 Environment 

All our experiments were run on "Nadal", a 60-core machine, where each core is an Intel® Xeon(R) 
E7-4890 machine running at 2.8 Ghz. Nadal has 1 TB of RAM and runs Linux kernel version 2.6- 
32. For compilation, we used GCC 4.9.2. We used Eigen 3.2.4, OpenMPI 1.8.4, Boost 1.55.7, and the 
latest version of Elemental at https : / /github. com/elemental For experiments with Elemental, 
we used OpenBlas, which is an extension of GotoBlas 1GVDG08I , for its parallel prowess; Eigen 
has built-in the BLAS and LAPACK packages. 


5.3 Dense Matrices 

Data Generation. In our experiments, we used two types of synthetic SPD matrices. The first 
type were diagonally dominant SPD matrices and were generated as follows. First, we created 
X E M nxn by drawing n 2 entries from a uniform sphere with center 0.5 and radius 0.25. Then, we 
generated a symmetric matrix Y by setting 

Y = 0.5* (X + X T ). 

Finally, we ensured that the desired matrix A is positive definite by adding the value n to each 
diagonal entry lCur09l of Y: A = Y + nl„. We call this method randSPDDenseDD. 
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The second approach generates SPD matrices that are not diagonally dominant. We created 
X, D E by drawing n 2 and n entries, respectively, from a uniform sphere with center 0.5 and 
radius 0.25; D is a diagonal matrix with small entries. Next, we generated an orthogonal random 
matrix Q = qr (X). Thus, Q is an orthonormal basis for X. Finally, we generated A = QDQ r . We 
call this method randSPDDense. randSPDDense is more expensive than randSPDDenseDD, 
as it requires an additional 0(n 3 ) computations for the QR factorization and the matrix-matrix 
product. 

Evaluation. To evaluate the runtime of Algorithm [3] against a baseline, we used the Cholesky de¬ 
composition to compute the logdet (A). More specifically, we computed A = LL V and returned 
logdet (A) = 2 • logdet (L). Since Elemental provides distributed and shared memory parallelism, 
we restricted ourselves to experiments with Elemental matrices throughout this section. Note that 
we measured the accuracy of the approximate algorithm in terms of the relative error to ensure 
that we have numbers of the same scale for matrices with vastly different values for logdet (A); 
we defined the relative error e as e = 100 (x — x)/x, where x is the true value and x is the approxi¬ 
mation. Similarly, we defined the speedup s as s = t x /tx, where t x is the time needed to compute 
x and tx is the time needed to compute the approximation x. 

Results. For dense matrices, we first used synthetic matrices generated using randSPDDense; 
these are relatively ill-conditioned matrices. We experimented with values of n (number of rows 
and columns of A) in the set {5, 000, 7, 500, 10,000, 12, 500, 15,000}. The three key points per¬ 
taining to these matrices are shown in Figure |T| First, we discuss the effect of m, the number 
of terms in the Taylor series used to approximate logdet (A); Figure [1(a)] depicts our results for 
the sequential case. On the y- axis, we see the relative error, which is measured against the exact 
logdet (A) as computed via the Cholesky factorization. We observe that for these ill-conditioned 
matrices, for small values of m (less than four) the relative error is high. However, for all values of 
m > 4, we observe that the error drops significantly and stabilizes. We note that in each iteration, 
all random processes were re-seeded with new values; we have plotted the error bars throughout 
Figure [Tj The standard deviation for both accuracy and time was consistently small; indeed, it is 
not visible to the naked eye at scale. To see the benefit of approximation, we look at Figure |l(b)| 
together with Figure |l(a)| For example, at m = 4, for all matrices, we get at least a factor of two 
speedup. As n gets larger, the speedups of the approximation also increase. For example, for 
n = 15, 000, the speedup at m = 4 is nearly six-fold. In terms of accuracy. Figure [1(a)] shows that at 
m = 4, the relative error is approximately 4%. This speedup is expected as the Cholesky factoriza¬ 
tion requires 0(n 3 ) operations; Algorithm [ 3 ] only relies on matrix-matrix products where one of 
the matrices has a small number of columns (equal to p), which is independent of n. Finally, we 
discuss the parallel speedup in Figure |l(c)| which shows the relative speedup of the approximate 
algorithm with respect to the baseline Cholesky algorithm. For this evaluation, we set m = 4 and 


is that the approximate algorithm provides nearly the same or increasingly better speedups rel¬ 
ative to a parallelized version of the exact (Cholesky) algorithm. For example, for n = 15,000, 
the speedups for using the approximate algorithm are consistently better that 6.5x. The absolute 
values for logdet (A) and timing along with the baseline numbers for this experiment are given in 
Table [TJ We report the numbers in Table [I] at m = 4 at which point, we have low relative error. 

For the second set of dense experiments, we generated diagonally dominant matrices using 


varied the number of processes, denoted by np, from 1 to 60. The main take away from Figure 1(c) 
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(a) Accuracy vs. m 


(b) Speedup vs. m 


(c) Parallel Speedup 


Figure 1: Panels 1(a) and 1(b) depict the effect of m (see Algorithm [3j on the accuracy of 
the approximation and the time to completion, respectively, for dense matrices generated by 
randSPDDense. For all the panels, p = 6 0 and t = 2 log \/4n. The baseline for all experiments 
was the Cholesky factorization, which was used to compute the exact value of logdet (A). For 
panels [1(a)] and |T(b)} the number of cores, n/p, was set to one. The last panel [1(c)] depicts the rel¬ 
ative speedup of the approximate algorithm when compared to the baseline solver (at m = 4). 
Elemental was used as the backend for these experiments. For the approximate algorithm, we 
report the mean and standard deviation of ten iterations. 


n 

logdet (A) 

time (secs) 

exact 

mean 

std 

exact 

mean 

std 

5000 

-3717.89 

-3546.920 

8.10 

2.56 

1.15 

0.0005 

7500 

-5474.49 

-5225.152 

8.73 

7.98 

2.53 

0.0015 

10000 

-7347.33 

-7003.086 

7.79 

18.07 

4.47 

0.0006 

12500 

-9167.47 

-8734.956 

17.43 

34.39 

7.00 

0.0030 

15000 

-11100.9 

-10575.16 

15.09 

58.28 

10.39 

0.0102 


Table 1: Accuracy and sequential running times (at p = 60, m = 4 and t = log \/ 4n) for dense 
random matrices generated using randSPDDense. Baselines were computed using the Cholesky 
factorization; mean and standard deviation are reported over ten iterations. 


randSPDDenseDD; we were able to quickly generate and run benchmarks on matrices of sizes 
n x n with n in the set {10,000, 20,000, 30,000, 40, 000} due to the relatively simpler procedure 
involved in matrix generation. In this set of of experiments, due to the diagonal dominance, 
all matrices were well-conditioned. The results of our experiments on these well-conditioned 
matrices are presented in Figure [2] and show a marked improved over the results presented in 
Figure [T| First, notice that very few terms of the Taylor series (i.e., small m) are sufficient to get 
high accuracy approximations; this is apparent in Figure |2(a)| In fact, we see that even at m = 2, 
we are near convergence and at m = 3, for most of the matrices, we have near-zero relative error. 
This experimental result, combined with Figure |2(b)| is particularly encouraging; at rn = 2, we 
seem to not only have a nearly lossless approximation of logdet (A), but also have at least a five¬ 
fold speedup. Similarly to Figure [TJ the speedups are better for larger matrices. For example, for 
n = 40,000, the speedup at m = 2 is nearly twenty-fold. We conclude our analysis by presenting 
Figure |2(c)j which similarly to Figure |l(c)| points out that at any level of parallelism. Algorithm [3] 


17 





















































NUMBER OF TAYLOR TERMS (m) NUMBER OF TAYLOR TERMS (m) NUM PROCESSORS 


(a) Accuracy vs. m 


(b) Speedup vs. m 


(c) Parallel Speedup 


Figure 2: Panels 2(a) and 2(b)| depict the effect of m (see Algorithm [3jl on the accuracy of the 
approximation and the time to completion, respectively, for diagonally dominant dense random 
matrices generated by randSPDDenseDD. For all the panels, p = 60 and t = 21og\/4n. The 
baseline for all experiments was the Cholesky factorization, which was used to compute the ex¬ 
act value of logdet (A). For panels 2(a) and 2(b) the number of cores, np, was set to one. The 
last panel |2(c)| depicts the relative speedup of the approximate algorithm when compared to the 
baseline solver (at m = 2). Elemental was used as the backend for these experiments. For the 
approximate algorithm, we report the mean and standard deviation over ten iterations. 


n 

logdet (A) 

time (secs) 

exact 

mean 

std 

exact 

mean 

std 

10000 

92103.1 

92269.5 

5.51 

18.09 

2.87 

0.01 

20000 

198069.0 

198397.4 

9.60 

135.92 

12.41 

0.02 

30000 

309268.0 

309763.8 

20.04 

448.02 

30.00 

0.12 

40000 

423865.0 

424522.4 

14.80 

1043.74 

58.05 

0.05 


Table 2: Accuracy and sequential running times (at p = 60, m = 2, and t = 2 log \/4n) for 
diagonally dominant dense random matrices generated using randSPDDenseDD. Baselines were 
computed using the Cholesky factorization; mean and standard deviation are reported over ten 
iterations. 


maintains its relative performance over the exact (Cholesky) factorization. The absolute values for 
logdet (A) and the corresponding running times, along with the baseline for this experiment are 
presented in Table [2] We report the numbers in Table [T| at m = 2, at which point we have a low 
relative error. 

5.4 Sparse Matrices 

Data Synthesis. To generate a sparse, synthetic matrix A E M nxn , with nnz non-zeros, we use a 
Bernoulli distribution to determine the location of the non-zero entries and a uniform distribution 
to generate the values. First, we completely fill the n principle diagonal entries. Next, we generate 
{nnz — to)/ 2 index positions in the upper triangle for the non-zero entries by sampling from a 
Bernoulli distribution with probability ( nnz—n ) / (n 2 — n). We reflect each entry across the principle 
diagonal to ensure that A is symmetric and we add n to each diagonal entry to ensure that A is 
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name 

n 

nnz 

area of origin 

logdet (A) 

time (sec) 

m 

exact 

approx 

exact 

approx 

mean 

std 

mean 

thermal2 

1228045 

8580313 

Thermal 

1.3869e6 

1.3928e6 

964.79 

31.28 

31.24 

149 

ecology2 

999999 

4995991 

2D/3D 

3.3943e6 

3.403e6 

1212.8 

18.5 

10.47 

125 

ldoor 

952203 

42493817 

Structural 

1.4429e7 

1.4445e7 

1683.5 

117.91 

17.60 

33 

thermomech.TC 

102158 

711558 

Thermal 

-546787 

-546829.4 

553.12 

57.84 

2.58 

77 

boneSOl 

127224 

5516602 

Model reduction 

1.1093e6 

1.106e6 

247.14 

130.4 

8.48 

125 


Table 3: Description of the SPD matrices from the University of Florida sparse matrix collec¬ 
tion [ PHI 1 1 that were used in our experiments. All experiments were run sequentially (np = 1) 
using Eigen. Accuracy results for Algorithm [3] are reported using both the mean and the standard 
deviation over ten iterations at (with t = 5 and p = 5); we only report the mean for the running 
times, since the standard deviation is negligible. The exact logdet (A) was computed using the 
Cholesky factorization. 


SPD (actually, A is also diagonally dominant). 

Real Data. To demonstrate the prowess of Algorithm [3] on real-world data, we used SPD matrices 
from the University of Florida's sparse matrix collection f DHll l. The complete list of matrices 
from this collection used in our experiments, as well as a brief description of each matrix, is given 
in columns 1-4 of Table |U 

Evaluation. It is tricky to pick any single method as the "exact method" to compute the logdet (A) 
for a sparse SPD matrix A. One approach would be to use direct methods such as Cholesky de¬ 
composition of A IDav06i. GupOO) . For direct methods, it is difficult to derive an analytical solution 
for the number of operations required for the factorization as a function of the number of non-zero 
entries of the matrix, as this is highly dependent on the structure of the matrix I GKK97 I. In the 
distributed setting, one also needs to consider the volume of communication involved, which is 
often the bottleneck. Alternately, we can use iterative methods to compute the eigenvalues of 
A IIDav75l and use the eigenvalues to compute logdet (A). It is clear that the worst case perfor¬ 
mance of both the direct and iterative methods is 0(n 3 ). However, iterative methods are typically 
used to compute a few eigenvalues and eigenvectors: therefore, we chose to use the Cholesky fac¬ 
torization based on matrix reordering to compute the exact value of logdet (A). It is important to 
note that both the direct and iterative methods are notoriously hard to implement, which comes 
to stark contrast with the almost trivial implementation of Algorithm [3j which is also readily par- 
allelizable. 

Results. The true power of Algorithm [3] lies in its ability to approximate logdet (A) for sparse A. 
The Cholesky factorization can introduce 0(n 2 ) non-zeros during factorization due to fill-in; for 
many problems, there is insufficient memory to factorize a large, sparse matrix. In our first set of 
experiments, we wanted to show the effect of m on: (1) convergence of logdet (A), and (2) cost of 
the solution. To this end, we generated sparse, diagonally dominant SPD matrices of size n = 10 6 
and varied the sparsity from 0.1% to 1% in increments of 0.25%. We did not attempt to compute 
the exact logdet (A) for these synthetic matrices — our aim was to merely study the speedup 
with m for different sparsities, while t and p were held constant at 2 log \/ 4n and 60 respectively. 
The results are shown in Figure [3] Figure [3(a)] depicts the convergence of logdet (A) measured as a 
relative error of the current estimate over the final estimate. As can be seen — for well conditioned 
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(a) Convergence with m 


(b) Cost as a function of m 


Figure 3: Panels |3(a)| and |3(b)| depict the effect of the number of terms in the Taylor expan¬ 
sion, m, (see Algorithm [3} on the convergence to the final solution and the time to comple¬ 
tion of the approximation. The matrix size was fixed at n = 10 6 and sparsity was varied as 
0.1%, 0.25%, 0.5%, 0.75 %, an d 1%. Experiments were run sequentially (np = 1) and we set p = 60, 
t = 2 log yAn. For panel [3(a)} the baseline is the final value of logdet (A) at m = 25. For panel [3(b)} 
the baseline is the time to completion of the approximation algorithm with m = 1. Eigen was used 
as the backend for these experiments. 


matrices — convergence is quick. Figure |3(b)| shows the relative cost of increasing m; here the 
baseline is m = 1. Therefore, the additional cost incurred by increasing m is linear when all other 
parameters are held constant. 

The results of running Algorithm [3} on the UFL matrices are shown in Table [3] The numbers 
reported for the approximation are the mean and standard deviation over ten iterations, t = 5, 
and p = 5 [^J The value of m was varied between one and 150 in increments of five to select the 
best average accuracy. The matrices shown in Table [3} have a nice structure, which lends itself to 
nice reorderings and therefore an efficient computation of the Cholesky factorization. We see that 
even in such cases, the performance of Algorithm [3} is commendable due to its lower algorithmic 
complexity; ldoor is the only exception as the approximation takes longer to compute than the 
Cholesky factorization. In the case of thermomech.TC, we achieve good accuracy while achieving 
a 22x speedup. 

6 Conclusions 

Prior work has presented approximation algorithms for the logarithm of the determinant of a sym¬ 
metric positive definite matrix; those algorithms either do not work for all SPD matrices, or do not 
admit a worst-case theoretical analysis, or both. In this work, we presented an approximation 
algorithm to compute the logarithm of the determinant of a SPD matrix that comes with strong 
theoretical worst-case analysis bounds and can be applied to any SPD matrix. A simplification of 
our algorithm delivers relative-error approximation guarantees for a popular special case of SPD 

5 We experimented with different p, t and settled on the smallest values that did not result in loss in accuracy. 
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matrices. Using state-of-the-art C ++ numerical linear algebra software packages for both dense and 
sparse matrices, we demonstrated that the proposed approximation algorithm performs remark¬ 
ably well in practice in serial and parallel environments. 
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